This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks.
Not all users receive the same offer, and that is the challenge to solve with this data set.
Your task is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.
Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. You'll see in the data set that informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, you can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.
You'll be given transactional data showing user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer.
Keep in mind as well that someone using the app might make a purchase through the app without having received an offer or seen an offer.
To give an example, a user could receive a discount offer buy 10 dollars get 2 off on Monday. The offer is valid for 10 days from receipt. If the customer accumulates at least 10 dollars in purchases during the validity period, the customer completes the offer.
However, there are a few things to watch out for in this data set. Customers do not opt into the offers that they receive; in other words, a user can receive an offer, never actually view the offer, and still complete the offer. For example, a user might receive the "buy 10 dollars get 2 dollars off offer", but the user never opens the offer during the 10 day validity period. The customer spends 15 dollars during those ten days. There will be an offer completion record in the data set; however, the customer was not influenced by the offer because the customer never viewed the offer.
This makes data cleaning especially important and tricky.
You'll also want to take into account that some demographic groups will make purchases even if they don't receive an offer. From a business perspective, if a customer is going to make a 10 dollar purchase without an offer anyway, you wouldn't want to send a buy 10 dollars get 2 dollars off offer. You'll want to try to assess what a certain demographic group will buy when not receiving any offers.
Because this is a capstone project, you are free to analyze the data any way you see fit. For example, you could build a machine learning model that predicts how much someone will spend based on demographics and offer type. Or you could build a model that predicts whether or not someone will respond to an offer. Or, you don't need to build a machine learning model at all. You could develop a set of heuristics that determine what offer you should send to each customer (i.e., 75 percent of women customers who were 35 years old responded to offer A vs 40 percent from the same demographic to offer B, so send offer A).
import numpy as np
import pandas as pd
import math
import json
import datetime
%matplotlib inline
# Libraries for plotting
import seaborn as sns
#from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline
import plotly.offline as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.tools as tls
import plotly.figure_factory as ff
#Import time
import time
#Ignore warning messages
import warnings
warnings.filterwarnings('ignore')
style_dict = {'background-color':'lightyellow','color':'#000000','border-color': 'red','font-family':'Roboto'}
Data source: Starbucks capstone project
# Read in the json files
portfolio = pd.read_json('portfolio.json', orient='records', lines=True)
profile = pd.read_json('profile.json', orient='records', lines=True)
transcript = pd.read_json('transcript.json', orient='records', lines=True)
Containing offer ids and meta data about each offer (duration, type, etc.)
portfolio
portfolio.shape
print("So, there were", len(portfolio),"type of offers in the given dataset.")
# Let's do some data preprocessing for better data handling and calculation
def portfolio_cleaning(portfolio):
'''
Function to clean portfolio dataframe
INPUT:
Portfolio - uncleaned portfolio dataframe
OUTPUT:
Portfolio_cleaned - cleaned portfolio dataframe
'''
portfolio_offer_type = pd.get_dummies(portfolio['offer_type'])
portfolio = portfolio.join(portfolio_offer_type)
channels = pd.get_dummies(portfolio.channels.apply(pd.Series).stack()).sum(level=0)
portfolio = portfolio.join(channels)
portfolio.drop(columns='channels', inplace=True)
portfolio = portfolio[['id','reward','difficulty','duration','offer_type', 'bogo','discount','informational',
"email", 'mobile','social','web']].rename(columns={"id": "offer_id"})
return portfolio
portfolio_cleaned = portfolio_cleaning(portfolio)
portfolio_cleaned
sns.countplot(portfolio['offer_type'])
plt.title('Number of different types of offer')
plt.ylabel('Counts')
plt.xlabel('Offer Type')
plt.show()
The second file contains customer demographic data including their age, gender, income, and when they created an account on the Starbucks rewards mobile application.
profile.head()
def profile_cleaning(profile):
'''
Function to clean profile dataframe
INPUT:
profile - uncleaned profile dataframe
OUTPUT:
profile - cleaned profile dataframe
'''
# Replacing 118 years age with NaN
profile['age'] = profile['age'].apply(lambda x: np.nan if x == 118 else x)
# Removing all the missing values
profile.dropna(subset=['age'], inplace=True)
# Creating a new column from "became_member_on" indicating the number of days since the user is a member of starbucks
profile['memberdays'] = datetime.datetime.today().date() - pd.to_datetime(profile['became_member_on'], format='%Y%m%d').dt.date
profile['memberdays'] = profile['memberdays'].dt.days
return profile
profile_cleaned = profile_cleaning(profile)
profile_cleaned.head()
# to be able to draw three subplots in a row
fig, ax = plt.subplots(figsize=(15, 10), nrows=2, ncols=2)
# plot a age distribution
plt.sca(ax[0,0])
sns.distplot(profile_cleaned.age)
plt.xlabel('Age')
plt.title('Age Distribution')
# plot a income distribution
plt.sca(ax[0,1])
sns.distplot(profile_cleaned.income)
plt.xlabel('Income')
plt.title('Income Distribution')
plt.sca(ax[1,0])
sns.countplot(profile_cleaned.gender)
plt.title('Gender Distribution')
plt.sca(ax[1,1])
sns.countplot(profile_cleaned['became_member_on'].astype(str).astype('datetime64[ns]').dt.year)
plt.title('Year Distribution')
plt.show()
male_customers = profile_cleaned[profile_cleaned['gender'] == "M"]
female_customers = profile_cleaned[profile_cleaned['gender'] == 'F']
other_customers = profile_cleaned[profile_cleaned['gender'] == 'O']
plt.figure()
fig, ax = plt.subplots(1,3, figsize=(12, 6))
fig.tight_layout()
sns.set(style="darkgrid")
sns.FacetGrid
sns.distplot(female_customers['income'], ax=ax[0]).set(title = 'Female Customers')
sns.distplot(male_customers['income'], ax=ax[1]).set(title = 'Male Customers')
sns.distplot(other_customers['income'], ax=ax[2]).set(title = 'Other Customers')
plt.show()
The third file describes customer purchases and when they received, viewed, and completed an offer. An offer is only successful when a customer both views an offer and meets or exceeds its difficulty within the offer's duration.
transcript.head(4)
transcript.shape
def offer_id(dataset):
offer = 0
if "offer id" in dataset:
offer = dataset["offer id"]
elif "offer_id" in dataset:
offer = dataset["offer_id"]
return offer
def amount(dataset):
amount = 0
if "amount" in dataset:
amount = dataset["amount"]
return amount
transcript["offer_id"] = transcript["value"].apply(lambda x: offer_id(x))
transcript["amount_spent"] = transcript["value"].apply(lambda x: amount(x))
transcript['time'] = transcript['time']/24.0
event_type = pd.get_dummies(transcript.event)
transcript = transcript.join(event_type)
transcript.drop(columns="value", inplace=True)
transcript[transcript.event =="offer completed"].head()
len(transcript.person.unique())
def transaction_cleaning(transcript, num_customers):
#Get the unique customer ids
customer_ids_list = list(transcript.person.unique())
#Create a dataframe wich we will populate below:
transactions_df = pd.DataFrame(columns= ['person', 'offer_id', 'time', 'offer_received', 'offer_viewed',
'offer_completed', 'offer_successful', 'transaction', 'amount_spent'])
#Create an empty dataframe
customer_df = pd.DataFrame()
offer_start_time = 0
for customer_id in customer_ids_list[:num_customers]:
#Get the transcript info about the customer
customer_df = transcript[transcript.person == customer_id]
#Get the list of orders this customer received/viewed/completed/transaction
offers_id_list = customer_df.offer_id
offers_id_list = offers_id_list.drop_duplicates()
#offers_id_list = [x for x in offers_id_list if pd.notnull(x)]
for offer_id in offers_id_list:
# Fill in dataframe of a particular customer and for offers
customer_df_1 = customer_df[(customer_df['person'] == customer_id) & (customer_df['offer_id']== offer_id)]
#Collect time when the customer completed the offer
cur_time_days = customer_df_1.loc[:,'time'].max()
#Collect event's info: offer received, offer viewed, and offer completed.
offer_completed = customer_df_1.loc[:,'offer completed'].max()
offer_received = customer_df_1.loc[:,'offer received'].max()
offer_viewed = customer_df_1.loc[:,'offer viewed'].max()
#Get transaction and money spent on the offer if available.
customer_transaction_df = customer_df[customer_df['person'] == customer_id]
cur_time_transaction = customer_transaction_df[(customer_transaction_df['time']> offer_start_time) & (customer_transaction_df['time']<= cur_time_days)]
amount_spent = cur_time_transaction.loc[:,'amount_spent'].sum()
#Transaction done or not
if (amount_spent):
transaction_done=1
else:
transaction_done=0
offer_start_time = cur_time_days
if (offer_received and offer_viewed and offer_completed):
offer_successful = 1
else:
offer_successful = 0
if (cur_time_days >= 0):
transactions_df = transactions_df.append({'person': customer_id, 'offer_id': offer_id, 'time': cur_time_days,
'offer_received': offer_received, 'offer_viewed': offer_viewed, 'offer_completed': offer_completed,
'offer_successful': offer_successful,'transaction':transaction_done, 'amount_spent':amount_spent}, ignore_index=True)
return transactions_df
transcript_cleaned = transaction_cleaning(transcript,17000)
transcript_cleaned
# Merging the data frames
temp_merged = transcript_cleaned.merge(profile_cleaned, how="inner", left_on="person", right_on="id")
temp_merged.drop("id", inplace=True, axis=1)
temp_merged = temp_merged.merge(portfolio_cleaned, how="left", left_on="offer_id", right_on="offer_id")
temp_merged.head()
There are three types of offers:
In this project we have to find out which type of offers suitable for the demographic groups based on their age, income and gender.
# Merge transcript and profile to get age, gender and income in transcript dataset:
temp_df1 = transcript.merge(profile, left_on="person", right_on="id", how = 'left')#.drop(columns =["became_member_on"] )
# Merge transcript and profile to get age, gender and income in transcript dataset:
temp_df2 = temp_df1.merge(portfolio, left_on="offer_id", right_on= "id", how = 'left')
temp_df2.drop(["id_x","id_y"], axis=1,inplace=True)
# Create age range and income range for better understanding of data distributionCreate age groups
bins = [20, 30, 40, 50, 60, 70, 80, 90 ,120]
labels = ['18-29', '30-39', '40-49', '50-59', '60-69', '70-79','80-90','90+']
temp_df2['age_range'] = pd.cut(temp_df2.age, bins, labels = labels,include_lowest = True)
bins_income = [20000,30000,40000,50000,60000,70000,80000,90000,100000]
labels_income = ['20k-30k','30k-40k','40k-50k','50k-60k','60k-70k','80k-90k','90k-100k','100k+']
temp_df2['income_range'] = pd.cut(temp_df2.income, bins_income, labels = labels_income,include_lowest = True)
temp_df2
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Create subplots: use 'domain' type for Pie subplot
labels = ["Male","Female","Other"]
colors = ['pink', 'mediumturquoise', 'darkorange', 'lightgreen']
fig = make_subplots(rows=3, cols=3, specs=[[{'type':'domain'}, {'type':'domain'},{'type':'domain'}],
[{'type':'domain'}, {'type':'domain'},{'type':'domain'}],
[{'type':'domain'}, {'type':'domain'},{'type':'domain'}]],
subplot_titles=["BOGO-(Offer Received)","BOGO-(Offer Viewed)","BOGO-(Offer Completed)",
"Discount-(Offer Received)","Discount-(Offer Viewed)","Discount-(Offer Completed)",
"Informational-(Offer Received)","Informational-(Offer Viewed)","Informational-(Offer Completed)",
])
fig.add_trace(go.Pie(labels=labels, values=temp_df2[(temp_df2.offer_type=="bogo")&(temp_df2.event=="offer received")].gender.value_counts(),name="BOGO-Offer-Received"),
1, 1)
fig.add_trace(go.Pie(labels=labels, values=temp_df2[(temp_df2.offer_type=="bogo")&(temp_df2.event=="offer viewed")].gender.value_counts(), name="BOGO-Offer-Viewed"),
1, 2)
fig.add_trace(go.Pie(labels=labels, values=temp_df2[(temp_df2.offer_type=="bogo")&(temp_df2.event=="offer completed")].gender.value_counts(), name="BOGO-Offer-Completed"),
1, 3)
fig.add_trace(go.Pie(labels=labels, values=temp_df2[(temp_df2.offer_type=="discount")&(temp_df2.event=="offer received")].gender.value_counts(), name="Discount-Offer-Received"),
2, 1)
fig.add_trace(go.Pie(labels=labels, values=temp_df2[(temp_df2.offer_type=="discount")&(temp_df2.event=="offer viewed")].gender.value_counts(), name="Discount-Offer-Viewed"),
2, 2)
fig.add_trace(go.Pie(labels=labels, values=temp_df2[(temp_df2.offer_type=="discount")&(temp_df2.event=="offer completed")].gender.value_counts(), name="Discount-Offer-Completed"),
2, 3)
fig.add_trace(go.Pie(labels=labels, values=temp_df2[(temp_df2.offer_type=="informational")&(temp_df2.event=="offer received")].gender.value_counts(), name="Informational-Offer-Received"),
3, 1)
fig.add_trace(go.Pie(labels=labels, values=temp_df2[(temp_df2.offer_type=="informational")&(temp_df2.event=="offer viewed")].gender.value_counts(), name="Informational-Offer-Viewed"),
3, 2)
fig.add_trace(go.Pie(labels=labels, values=temp_df2[(temp_df2.offer_type=="infromational")&(temp_df2.event=="offer completed")].gender.value_counts(), name="GHG Emissions"),
3, 3)
fig.update_traces( textfont_size=10,marker=dict(colors=colors, line=dict(color='#000000', width=2)))
fig.update_layout(autosize=False, width=1000, height=1000)
fig.show()
# Create a figure instance, and the two subplots
fig = plt.figure(figsize=(15,15))
ax1 = fig.add_subplot(311)
ax2 = fig.add_subplot(312)
ax3 = fig.add_subplot(313)
sns.countplot(x = "age_range",hue= "event",data = temp_df2[(temp_df2.offer_type=="bogo")&(temp_df2.gender=='M')],ax=ax1)
ax1.set(ylabel='Count', xlabel='Age groups')
ax1.set_title("Response of males of different age groups to BOGO offer")
sns.countplot(x = "age_range",hue= "event",data = temp_df2[(temp_df2.offer_type=="bogo")&(temp_df2.gender=='F')],ax=ax2)
ax2.set(ylabel='Count', xlabel='Age groups')
ax2.set_title("Response of females of different age groups to BOGO offer")
sns.countplot(x = "age_range",hue= "event",data = temp_df2[(temp_df2.offer_type=="bogo")&(temp_df2.gender=='O')],ax=ax3)
ax3.set(ylabel='Count', xlabel='Age groups')
ax3.set_title("Response of people from other genders and of different age groups to BOGO offer")
plt.show()
print("How many customers (age groupwise) completed BOGO offers (in %) who received them??")
for i in ["M","F","O"]:
bogo_offer_received = temp_df2[(temp_df2.offer_type=="bogo")&(temp_df2.gender==i)&(temp_df2.event=='offer received')].age_range.value_counts()
bogo_offer_completed= temp_df2[(temp_df2.offer_type=="bogo")&(temp_df2.gender==i)&(temp_df2.event=='offer completed')].age_range.value_counts()
print("Gneder:",i)
print((bogo_offer_completed/bogo_offer_received)*100)
# Create a figure instance, and the two subplots
fig = plt.figure(figsize=(15,15))
ax1 = fig.add_subplot(311)
ax2 = fig.add_subplot(312)
ax3 = fig.add_subplot(313)
sns.countplot(x = "income_range",hue= "event",data = temp_df2[(temp_df2.offer_type=="bogo")&(temp_df2.gender=='M')],ax=ax1)
ax1.set(ylabel='Count', xlabel='Income groups')
ax1.set_title("Response of male in different income groups to BOGO offer")
sns.countplot(x = "income_range",hue= "event",data = temp_df2[(temp_df2.offer_type=="bogo")&(temp_df2.gender=='F')],ax=ax2)
ax2.set(ylabel='Count', xlabel='Income groups')
ax2.set_title("Response of female in different income groups to Buy one and get one offer")
sns.countplot(x = "income_range",hue= "event",data = temp_df2[(temp_df2.offer_type=="bogo")&(temp_df2.gender=='O')],ax=ax3)
ax3.set(ylabel='Count', xlabel='Income groups')
ax3.set_title("Response of other gender in different income groups to Buy one and get one offer")
plt.show()
print("How many customers (Income and gender groupwise) completed BOGO offers (in %) who received them??")
for i in ["M","F","O"]:
bogo_offer_received = temp_df2[(temp_df2.offer_type=="bogo")&(temp_df2.gender==i)&(temp_df2.event=='offer received')].income_range.value_counts()
bogo_offer_completed= temp_df2[(temp_df2.offer_type=="bogo")&(temp_df2.gender==i)&(temp_df2.event=='offer completed')].income_range.value_counts()
#print(i,":How many customers completed BOGO offers (in %) who received them.")
print(i)
print((bogo_offer_completed/bogo_offer_received)*100)
# Create a figure instance, and the two subplots
fig = plt.figure(figsize=(15,15))
ax1 = fig.add_subplot(311)
ax2 = fig.add_subplot(312)
ax3 = fig.add_subplot(313)
sns.countplot(x = "age_range",hue= "event",data = temp_df2[(temp_df2.offer_type=="discount")&(temp_df2.gender=='M')],ax=ax1)
ax1.set(ylabel='Count', xlabel='Age groups')
ax1.set_title("Response of male of different age groups to discount offer")
sns.countplot(x = "age_range",hue= "event",data = temp_df2[(temp_df2.offer_type=="discount")&(temp_df2.gender=='F')],ax=ax2)
ax2.set(ylabel='Count', xlabel='Age groups')
ax2.set_title("Response of female of different age groups to discount offer")
sns.countplot(x = "age_range",hue= "event",data = temp_df2[(temp_df2.offer_type=="discount")&(temp_df2.gender=='O')],ax=ax3)
ax3.set(ylabel='Count', xlabel='Age groups')
ax3.set_title("Response of other gender of different age groups to discount offer")
plt.show()
print("How many customers (age groupwise) completed discount offers (in %) who received them??")
for i in ["M","F","O"]:
discount_offer_received = temp_df2[(temp_df2.offer_type=="discount")&(temp_df2.gender==i)&(temp_df2.event=='offer received')].age_range.value_counts()
discount_offer_completed= temp_df2[(temp_df2.offer_type=="discount")&(temp_df2.gender==i)&(temp_df2.event=='offer completed')].age_range.value_counts()
#print(i,":How many customers completed discount offers (in %) who received them.")
print(i)
print((discount_offer_completed/discount_offer_received)*100)
# Create a figure instance, and the two subplots
fig = plt.figure(figsize=(15,15))
ax1 = fig.add_subplot(311)
ax2 = fig.add_subplot(312)
ax3 = fig.add_subplot(313)
sns.countplot(x = "income_range",hue= "event",data = temp_df2[(temp_df2.offer_type=="discount")&(temp_df2.gender=='M')],ax=ax1)
ax1.set(ylabel='Count', xlabel='Income groups')
ax1.set_title("Response of male in different income groups to discount offer")
sns.countplot(x = "income_range",hue= "event",data = temp_df2[(temp_df2.offer_type=="discount")&(temp_df2.gender=='F')],ax=ax2)
ax2.set(ylabel='Count', xlabel='Income groups')
ax2.set_title("Response of female in different income groups to discount offer")
sns.countplot(x = "income_range",hue= "event",data = temp_df2[(temp_df2.offer_type=="discount")&(temp_df2.gender=='O')],ax=ax3)
ax3.set(ylabel='Count', xlabel='Income groups')
ax3.set_title("Response of other gender in different income groups to discount offer")
plt.show()
print("How many customers (Income groupwise) completed dicount offers (in %) who received them??")
for i in ["M","F","O"]:
discount_offer_received = temp_df2[(temp_df2.offer_type=="discount")&(temp_df2.gender==i)&(temp_df2.event=='offer received')].income_range.value_counts()
discount_offer_completed= temp_df2[(temp_df2.offer_type=="discount")&(temp_df2.gender==i)&(temp_df2.event=='offer completed')].income_range.value_counts()
print(i)
print((discount_offer_completed/discount_offer_received)*100)
# Create a figure instance, and the two subplots
fig = plt.figure(figsize=(15,15))
ax1 = fig.add_subplot(311)
ax2 = fig.add_subplot(312)
ax3 = fig.add_subplot(313)
sns.countplot(x = "age_range",hue= "event",data = temp_df2[(temp_df2.offer_type=="informational")&(temp_df2.gender=='M')],ax=ax1)
ax1.set(ylabel='Count', xlabel='Age groups')
ax1.set_title("Response of male of different age groups to informational offer")
sns.countplot(x = "age_range",hue= "event",data = temp_df2[(temp_df2.offer_type=="informational")&(temp_df2.gender=='F')],ax=ax2)
ax2.set(ylabel='Count', xlabel='Age groups')
ax2.set_title("Response of female of different age groups to informational offer")
sns.countplot(x = "age_range",hue= "event",data = temp_df2[(temp_df2.offer_type=="informational")&(temp_df2.gender=='O')],ax=ax3)
ax3.set(ylabel='Count', xlabel='Age groups')
ax3.set_title("Response of other gender of different age groups to informational offer")
plt.show()
print("How many customers (age groupwise) completed informal offers (in %) who received them??")
for i in ["M","F","O"]:
informational_offer_received = temp_df2[(temp_df2.offer_type=="informational")&(temp_df2.gender==i)&(temp_df2.event=='offer received')].age_range.value_counts()
informational_offer_viewed= temp_df2[(temp_df2.offer_type=="informational")&(temp_df2.gender==i)&(temp_df2.event=='offer viewed')].age_range.value_counts()
print(i)
print(np.mean(informational_offer_viewed/informational_offer_received))
# Create a figure instance, and the two subplots
fig = plt.figure(figsize=(15,15))
ax1 = fig.add_subplot(311)
ax2 = fig.add_subplot(312)
ax3 = fig.add_subplot(313)
sns.countplot(x = "income_range",hue= "event",data = temp_df2[(temp_df2.offer_type=="informational")&(temp_df2.gender=='M')],ax=ax1)
ax1.set(ylabel='Count', xlabel='Income groups')
ax1.set_title("Response of male in different income groups to informational offer")
sns.countplot(x = "income_range",hue= "event",data = temp_df2[(temp_df2.offer_type=="informational")&(temp_df2.gender=='F')],ax=ax2)
ax2.set(ylabel='Count', xlabel='Income groups')
ax2.set_title("Response of female in different income groups to informational offer")
sns.countplot(x = "income_range",hue= "event",data = temp_df2[(temp_df2.offer_type=="informational")&(temp_df2.gender=='O')],ax=ax3)
ax3.set(ylabel='Count', xlabel='Income groups')
ax3.set_title("Response of other gender in different income groups to informational offer")
plt.show()
print("How many customers (Income groupwise) completed informational offers (in %) who received them??")
for i in ["M","F","O"]:
informational_offer_received = temp_df2[(temp_df2.offer_type=="informational")&(temp_df2.gender==i)&(temp_df2.event=='offer received')].income_range.value_counts()
informational_offer_viewed= temp_df2[(temp_df2.offer_type=="informational")&(temp_df2.gender==i)&(temp_df2.event=='offer viewed')].income_range.value_counts()
print(i)
print((informational_offer_viewed/informational_offer_received)*100)
df = temp_df2[(temp_df2.event=="offer completed")&(temp_df2.age<=100)]
x, y, hue = "age_range", "proportion", "gender"
hue_order = ["M", "F",'O','X']
(df[x].groupby(df[hue]).value_counts(normalize=True).rename(y).reset_index()
.pipe((sns.barplot, "data"), x=x, y=y, hue=hue))
plt.title("Age and gender wise proportion of customers who completed offers " )
plt.show()
temp_merged.gender = temp_merged.gender.apply(lambda x: 1 if "M" else 0)
temp_merged.head(2)
temp_merged.columns
#create X and y
X = temp_merged.drop(columns=['person', 'offer_id','offer_successful',"offer_type","became_member_on"])
X = X.replace(np.nan,0)
y = temp_merged.offer_successful.astype('int')
X.head()
y.head()
For this classification project I have used classification metrics from here:
# all_in_one is a fuction, created for splitiing of a dataset inot 3 parts and to do the repatative tasks namely, draw learning curves, ROC curves and model classification analysis(Error Analysis).
# Import basic libraries
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
import time
#ignore warning messages
import warnings
warnings.filterwarnings('ignore')
#Algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
# Import libraries for performance analysis (Error analysis)
from sklearn.metrics import roc_auc_score , precision_score, recall_score, f1_score, classification_report,accuracy_score,confusion_matrix,roc_curve, auc
X_train1, X_test, y_train1, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
#Again,Split training1 dataset into training and cross-validation datasets by 80:20 ratio.
X_train, X_val, y_train, y_val = train_test_split(X_train1, y_train1, test_size=0.25, random_state=1)
print ("Training Dataset :", X_train.shape, y_train.shape)
print ("Testing Dataset:", X_test.shape, y_test.shape)
print ("Validation Dataset:", X_val.shape, y_val.shape)
# Create a dictionary of the classifiers
classifiers_dict = {'Logistic Classifier': LogisticRegression(class_weight='balanced'),
'Decision_Tree Classifier': DecisionTreeClassifier(class_weight='balanced'),
'Random_Forest Classifier': RandomForestClassifier(class_weight='balanced'),
#'SVM Classifier': SVC(probability=True,gamma='scale'),
"GaussianNB Classifier":GaussianNB(),
"KNN Classifiers": KNeighborsClassifier(),
"GB Classifier": GradientBoostingClassifier(loss = 'deviance'),
"XGB Classifier" : XGBClassifier(scale_pos_weight = 2)}
target_names=['Offer failed', 'Offer successful']
# All Classification reports + Accuracy reports + Confusion matrices
results = pd.DataFrame([[0, 0,0,0, 0,0 ,0]],columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 ','ROC', 'Time'])
for name, classifier in classifiers_dict.items():
#print(name)
start = time.time()
clf = classifier.fit(X_train, y_train)
y_pred = clf.predict(X_val)
end = time.time()
roc=roc_auc_score(y_val, y_pred)
acc = accuracy_score(y_val, y_pred)
prec = precision_score(y_val, y_pred)
rec = recall_score(y_val, y_pred)
f1 = f1_score(y_val, y_pred)
t = end-start
#Model_results = pd.DataFrame(columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 Score','ROC'])
model_results = pd.DataFrame([[name, acc,prec,rec, f1,roc, t]],columns = ['Model', 'Accuracy', 'Precision', 'Recall', 'F1 ','ROC','Time'])
results = results.append(model_results, ignore_index = True)
#print(confusion_matrix(y_pred,y_test, target_names=target_names ))
print(results.loc[1:,:])
print ('\n==========================================================================\n')
Note: Due to long processing time I have removed SVM algorithm from above.
# All in one Receiver Operating Characteristic (ROC) curve
plt.figure(figsize=(15,10))
for name, classifier in classifiers_dict.items():
fit = classifier.fit(X_train, y_train)
y_pred = classifier.predict_proba(X_val)[:,1]
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_val,y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
# Roc curve
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_val,y_pred)
roc_auc = auc(false_positive_rate, true_positive_rate)
fpr = false_positive_rate
tpr = true_positive_rate
plt.plot(fpr, tpr,lw=2 ,label =name) #'ROC curve of class {0} (area = {1:0.2f})' ''.format(i, roc_auc[i]))
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--',label='ROC curve (area = %0.2f)' % roc_auc)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve of all the classifiers')
plt.grid(True)
plt.legend(loc="lower right")
plt.show()
Ad we can see we are getting 100% accuracy in the prediction, so i don't want to do the parameter tunning process.
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=1)
clf = DecisionTreeClassifier(random_state=212)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
roc=roc_auc_score(y_test, y_pred)
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred)
rec = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
confusion_matrix(y_test, y_pred)
#print(" roc:",roc,"\n","acc:",acc,"\n","prec:",prec,"\n","rec:",rec,"\n","f1:",f1)
(y_test==1).sum()
(y_test==0).sum()
The results are look good to me.
Also, one can use Decision tree, Random Forest, GradientBoosting Classifier or XGB model(s) to see if a new customer will complete the offers or not.